plae Analysis Examples
Welcome
This document contains some simple examples for how to use the advanced filtering options in https://plae.nei.nih.gov and a brief introduction to using the Seurat file to do custom tests.
Gene Focused
By “gene focused” we mean that:
- You have a gene you are interested in
- You want to learn more about where / when / what it is expressed in
In silico In Situ
For those who are most comfortable with stained in situ slices of the retina this visualization may be useful. The major cell types of the retina are laid on in rough anatomical positioning. The cell types are colored by intensity, with the brighter colors meaning that the gene is more highly expressed in that cell type. As an example we show expression of RHO (rhodopsin, a rod marker) and RPE65 (RPE marker).
UMAP - Table
If you are curious about a gene, then there are several ways you can learn about its retinal cell type expression patterning. We will use ATOH7, a transcription factor that regulates retinal ganglion development as our example gene.
The UMAP view is a two dimensional representation of the individual cells in the scEiaD. Cells that are closer together have more related gene expression profiles (and thus are likely to be similar cell types).
Let’s go the UMAP - Table viewer in plae:
Viz -> UMAP - Tables
Dark gray are cells which have no detectable ATOH7.
Yellow is the highest expression and dark purple is the lowest expression.
Show highest expressing cells
What if we want to see which cells have the highest expression? We can use the “Filter Gene Expression” slider to only show cells with expression above a log2(expression) value.
We see that the highest expressing cells are in the “center” before the branching happens.
Species Filtering
By default plae shows data for all organisms in the database (human, mouse, macaque).
If we only want to see ATOH7 expression in human data, then that is very easy with the powerful “Scatter Filter Category” and “Gene Filter On” sections.
Table Information
While the UMAP view is cool looking, it can’t show you everything….what if we want to know what kind of cells are expressing ATOH7?
We can have quantified information on where ATOH7 is expressed by Cell Type (predicted) (this is our machine learned cell type labels) and organism.
We see that about 50% of the mouse and human neurogenic cell type express ATOH7. In raw counts that is 16,449 of 28,811 total mouse neurogenic cells. They have an average expression of 1.86. You can sort or filter the table based on queries. If you wanted to see ATOH7 expression in the RGCs this is trivial to do by typing in the box below.
This shows us that ATOH7 expression seems to be dropping in the maturing/mature RGCs (and is much lower in the macaque) relative to the neurogenic population.
Study filtering
As scEiaD is constructed from publicly available datasets, you can also filter the data to only show results from a specific paper. This may be useful if you using the results from that paper and want to check or confirm a finding.
You can see information about the papers / studies in scEiaD by using the adjacent “Make Meta Table” section as follows:
We see that the Clark et al. 2019 study did Smart-seq2 and 10X across many developmental time points in mouse. They study_accession ID is SRP158081. We can use this ID to look at ATOH7 expression only in this study in both the UMAP view and the table view
Expression Plot
As we have a huge number of studies and samples, we can use this (for single cell data) unusual view: a boxplot! We can see how ATOH7 expression changes across celltype and study.
How do we get here?
Make a plot….that shows nothing?
We’ve entered ATOH7 as the gene to plot (1). We are faceting (splitting the plot into separate sub-plots) on Cell Type (predict) (2). We are coloring the data points by study_accession (each study’s average gene expression across the Cell Type (predict) is plotted separately) (3). But we see … nothing. Why?
That is because the Plot Height (400) is not high enough. The text is prioritized over the data, so they are hidden. As it is extremely difficult to “auto” pick the correct height, it was more straightforward to have the user pick it. Usually a value around 700 - 1200 will give a reasonable view.
Useable plot
So yes, now we can see the data.
Some cones have ATOH7 expression?
So each point is an independent study. We see high ATOH7 expression in the neurogenic population, across many studies. But we also see one of studies with ATOH7 expression in cones. The legend shows which colors correspond to which study.
One of the studies is a bright pink…probably SRP200599. We can confirm that by replotting the data with a filter that only shows study_accession SRP200599.
Yep, that is it. We can jump to the UMAP - Table view to pull up the metadata we have extracted about the study.
This is from Buenaventura et al. and is a study that enriched early mouse cones. Some work suggests that loss of ATOH7 inhibits cone specification. These cones may be “early” cones or late neurogenic cells that are developing into cones.
Cell Type Focused
If you want to get a sense about what is present in scEiaD, then there are several tools you can use. For our example, we will be starting with the rods.
How many rods do we have?
These are published labels
How many rods do we have after the machine learning?
How many rods (predicted) do we have across organism?
How many rods (predicted) we we have across organism and study?
What genes are differentially expressed in the Rods?
And what does ROM1 expression look like in the UMAP?
Very high expression in the rods (and expressed in other cells too)
Advanced Stuff - Analysis in R
We provide the full data as seurat (v3) or anndata (scanpy) objects you can download for boutique analysis. Here we demonstrate how you can use the Seurat object to run a quick custom diff test on a sub-population of neurogenic cells.
library(Seurat)## Attaching SeuratObject
library(tidyverse)## Registered S3 method overwritten by 'cli':
## method from
## print.boxx spatstat.geom
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#system('wget -O ~/data/scEiaD/scEiaD_all_seurat_v3.Rdata http://hpc.nih.gov/~mcgaugheyd/scEiaD/2021_03_17/scEiaD_all_seurat_v3.Rdata')
load('~/data/scEiaD/scEiaD_all_seurat_v3.Rdata')ID neurogenic cells in different clusters
scEiaD_droplet@meta.data %>%
group_by(cluster, CellType_predict) %>%
summarise(Count = n()) %>%
mutate(Perc = (Count / sum(Count)) * 100) %>%
filter(Perc > 5) %>%
filter(CellType_predict == 'Neurogenic Cells') %>%
arrange(-Perc)## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups: cluster [4]
## cluster CellType_predict Count Perc
## <dbl> <chr> <int> <dbl>
## 1 41 Neurogenic Cells 13633 80.8
## 2 42 Neurogenic Cells 15566 55.2
## 3 25 Neurogenic Cells 1100 5.65
## 4 9 Neurogenic Cells 1031 5.54
The most common is cluster 41 and 42 (by both count and proportion that is Neurogenic)
Compare cluster 41 - Neurogenic vs cluster 42 - Neurogenic
This will take a few minutes to run. This is why we cannot offer on-demand custom diff testing as it takes at least a few minutes to run and the whole app shuts down for everyone while it computes.
# Create new column with a pasted together cluster ID and a CellType
scEiaD_droplet <- AddMetaData(scEiaD_droplet,
metadata =
paste0(
scEiaD_droplet@meta.data$cluster,
'_',
scEiaD_droplet@meta.data$CellType_predict
),
col.name = 'clusterCT' )
# tell Seurat this new column is the default identity
Idents(scEiaD_droplet) <- scEiaD_droplet@meta.data$clusterCT
diff_test <- FindMarkers(scEiaD_droplet, ident.1 = '41_Neurogenic Cells', ident.2 = '42_Neurogenic Cells')
diff_test %>% arrange(-avg_log2FC) %>% head(5)## p_val avg_log2FC pct.1 pct.2 p_val_adj
## ENSG00000131747 0 2.988279 0.609 0.059 0
## ENSG00000171848 0 2.719401 0.560 0.078 0
## ENSG00000117724 0 2.596955 0.529 0.069 0
## ENSG00000170312 0 2.557732 0.534 0.039 0
## ENSG00000188486 0 2.549683 0.666 0.225 0
We see the top hit (by log fold change) that is over expressed in cluster6-Neurogenic is ENSG00000131747 (TOP2A), which is a marker of cell proliferation.
ID RPE cells in different clusters
scEiaD_droplet@meta.data %>%
group_by(cluster, CellType_predict) %>%
summarise(Count = n()) %>%
mutate(Perc = (Count / sum(Count)) * 100) %>%
filter(Perc > 5) %>%
filter(CellType_predict == 'RPE') %>%
arrange(-Perc)## `summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
## # A tibble: 2 x 4
## # Groups: cluster [2]
## cluster CellType_predict Count Perc
## <dbl> <chr> <int> <dbl>
## 1 34 RPE 1485 83.0
## 2 47 RPE 590 13.1
Compare cluster 34 - RPE vs cluster 47 - RPE
# Create new column with a pasted together cluster ID and a CellType
scEiaD_droplet <- AddMetaData(scEiaD_droplet,
metadata =
paste0(
scEiaD_droplet@meta.data$cluster,
'_',
scEiaD_droplet@meta.data$CellType_predict
),
col.name = 'clusterCT' )
# tell Seurat this new column is the default identity
Idents(scEiaD_droplet) <- scEiaD_droplet@meta.data$clusterCT
diff_test <- FindMarkers(scEiaD_droplet, ident.1 = '34_RPE', ident.2 = '47_RPE')
diff_test %>% arrange(avg_log2FC) %>% head(20)## p_val avg_log2FC pct.1 pct.2 p_val_adj
## ENSG00000167526 1.005269e-299 -5.109230 0.048 0.824 1.584204e-295
## ENSG00000142676 0.000000e+00 -4.703629 0.049 0.849 0.000000e+00
## ENSG00000143947 3.896901e-298 -4.672049 0.071 0.856 6.141126e-294
## ENSG00000162244 1.115901e-279 -4.631504 0.081 0.827 1.758549e-275
## ENSG00000137154 0.000000e+00 -4.483151 0.040 0.842 0.000000e+00
## ENSG00000231500 2.541170e-264 -4.397065 0.150 0.864 4.004630e-260
## ENSG00000137818 4.616330e-281 -4.380650 0.063 0.822 7.274875e-277
## ENSG00000109475 3.207695e-266 -4.363803 0.147 0.873 5.055007e-262
## ENSG00000108107 4.050680e-290 -4.353569 0.086 0.858 6.383466e-286
## ENSG00000142937 1.210743e-238 -4.300096 0.275 0.895 1.908010e-234
## ENSG00000198804 2.566833e-205 -4.228349 0.292 0.856 4.045073e-201
## ENSG00000026025 1.389353e-249 -4.183896 0.247 0.902 2.189481e-245
## ENSG00000125968 6.820280e-260 -4.067235 0.059 0.773 1.074808e-255
## ENSG00000132386 6.088435e-254 -3.887429 0.057 0.795 9.594764e-250
## ENSG00000084207 2.509505e-287 -3.884232 0.061 0.858 3.954729e-283
## ENSG00000105193 5.532073e-287 -3.864908 0.058 0.822 8.717994e-283
## ENSG00000111640 1.948237e-283 -3.864453 0.045 0.808 3.070226e-279
## ENSG00000185664 9.607921e-275 -3.834617 0.118 0.893 1.514112e-270
## ENSG00000185633 2.086332e-174 -3.825968 0.010 0.495 3.287851e-170
## ENSG00000145741 1.740524e-276 -3.786346 0.079 0.825 2.742892e-272
One of the first non-ribosome or mitochondrial gene is vimentin, which has been previously identified as over-expressed in cultured and proliferating RPE (https://pubmed.ncbi.nlm.nih.gov/1700982/). The RPE in cluster 47 are iPSC-based.
Check RPE65, TTR, TYRP1 and Vimentin expression across the two RPE cluster (and RGC)
RPE65, TTR, and TYRP1 are markers for functional RPE and are high in both groups of RPE (and as a quick control, virtually all off in RGC). Vimentin is much higher in the iPSC RPE cluster (37).
scEiaD_droplet__RPE <- subset(scEiaD_droplet, idents = c('34_RPE','47_RPE', '51_Retinal Ganglion Cells', '7_RPCs'))
VlnPlot(scEiaD_droplet__RPE, features = c('ENSG00000116745', 'ENSG00000107165', 'ENSG00000118271', 'ENSG00000026025'))